Introduction to Open Data Science - Course Project

Chapter 1: Start me up!

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Sun Nov 26 15:23:03 2023"

The text continues here.

1. Some thoughts about this course

e.g., How are you feeling right now? What do you expect to learn? Where did you hear about the course?

As an R beginners, I feel this course is very friendly. Both the instructions from the teacher and the book are clear enough for me to understand and move on easily. The learning curve looks smooth, and the instructions are also encouraging. All these provide confidence for a beginner. I have been impressed by the versatile and powerful usage of R, so I want to add it into my research tool box although my research currently does not involve it (potential in the future if I own the skill). In other words, learning R is my interest or hobby at this moment. It is always good to learn some fresh things in my spare time - R is the one recently! I heard about this course from my colleague who usually can make many impressive graphs with statistical data. He became really skillful in operating R and realizing many functions after one-year learning, starting from this course one year ago. His feedback on this course and the learning outcomes really encouraged me to get engaged into R from this course.

2. Reflection on the book and the Exercise 1

How did it work as a “crash course” on modern R tools and using RStudio? Which were your favorite topics? Which topics were most difficult? Some other comments on the book and our new approach of getting started with R Markdown etc.?

This book is clear, well instructional, and encouraging for an R beginner. I enjoy learning the part of data wrangling and visualization, which is easily accessible and readable. My favorite character of R is the instant feedback after executing the commands, no matter right or wrong. This is what a researcher is eager for, but seldom meets in the academic career after making some efforts. However, I could feel challenging in the data analysis which needs specialized knowledge in statistics to interpret the R outcomes. It could be easy to run the analysis functions but not easy to understand the meaning of the results. Therefore, more things should be learned when I move on to the data analysis in R. The book has provided so many instructions to go through and practice, which are clear to follow. I think it is very nice and enough for me to get start with R. Just need to spend time in it.

4. Describe the work I have done this week and summarize my learning

Following the detail instruction in “Start me up”, I set up the technical platforms needed in this course, including R, RStudio, and GitHub. Following the teacher’s instruction, I got the know how to push the updated script to the GitHub. Then I started the learning of R with the book “R for Health Data Science”, and the corresponding exercise 1. In this week, I finished the learning of Chapter 1-5 in the book, which was a heavy and time-consuming task, but rewarding! In addition, I read the recommended Part 1 of the textbook “MABS4IODS”, and the two webinars on RStudio and GitHub.


Chapter 2: Regression and model validation

Describe the work you have done this week and summarize your learning.

date()
## [1] "Sun Nov 26 15:23:03 2023"

Data analysis assignment

1 Read the data

Read the students2014 data into R and explore the structure and the dimensions of the data

students2014 <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt", sep=",", header=TRUE)

dim(students2014)
## [1] 166   7
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Description: The data set consists of 166 observations of 7 variables. The data present the information of gender, age, attitude, learning points of 166 students, and their responses to the questions related to deep, surface and strategic learning.

2 Grapical overview with a plot matrix

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# For making a nicer plot
my_lower <- function(data, mapping, ...) {
  ggplot(data = data, mapping=mapping) +
    geom_point(..., alpha=1, pch = 21, color ="black") +
    theme(
      panel.background = element_rect(fill = "white", color = "black")
    )
}

my_diag <- function(data, mapping, ...) {
  ggplot(data = data, mapping=mapping) +
    geom_density(..., alpha=0.7,  color ="black") +
    theme(
      panel.background = element_rect(fill = "white", color = "black")
    )
}

my_discrete <- function(data, mapping, ...) {
  ggplot(data = data, mapping=mapping) +
    geom_bar(..., alpha=1,  color ="black") +
    theme(
      panel.background = element_rect(fill = "white", color = "black")
    )
}

my_upper_combo<- function(data, mapping, ...) {
  ggplot(data = data, mapping=mapping) +
    geom_boxplot(...,alpha=1, color ="black") +
    geom_jitter(...,alpha = 0.5, width = 0.25, size = 1, pch = 21, color = "black")+
    theme(
      panel.background = element_rect(fill = "white", color = "black")
    )
}

# Plot
p <- ggpairs(students2014, 
             mapping = aes(col = gender, fill = gender, alpha = 0.7), 
             lower = list(
               continuous = my_lower,
               combo = wrap("facethist", bins = 20, color = "black")
               ),
             diag = list(
               continuous = my_diag,
               discrete =  my_discrete
             ),
             upper = list(
               combo = my_upper_combo,
               continuous = wrap("cor", color = "black")
             )
             )+
  scale_fill_manual(values = c("lightblue", "wheat"))+
  scale_color_manual(values = c("lightblue", "wheat"))
  

# Print the plot
p

# Summaries of the variables in the data
summary(students2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Description: Most of the numeric variables in the data are relatively normally distributed, except for the age which is mostly distributed around twenties. Female are about two times more than males in frequency. Within the variables, attitude towards statistics seems to be most strongly correlated with exam points (r=0.437).

3 Fit a regression model

Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.

# Choose attitude, stra, and surf as the three explanatory variables because they have highest (absolute) correlation with the target variable exam points, according to the plot matrix.
model1 <- lm(points ~ attitude + stra + surf, data = students2014)

# Print out the summary of the first model
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
# Remove the unfit variables and new fitted regression model
model2 <- lm(points ~ attitude, data = students2014)

# Print out the summary of the fitted model
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Description: “Attitude” was the single significant predictor in the model. Other two variables entered model–“stra” and “surf”–had p-values 0.12 and 0.47, respectively. Therefore, the model was re-fitted with “attitude” alone, producing a final model that explains 19.06% (R squared = 0.1906) of the variance of the response (exam points).

4 Interpret the fitted model

Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model.

summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Description: The resulting model shows attitude is a meaningful predictor for exam points, specifically:

- a. With a unit of increase in attitude, the exam will increase by 3.5 points

- b. When the effect of attitude is held as none, the average exam points is 11.637. In other words, students would be expected to have 11.637 points in the exam if attitude did not influence at all.

- c. The model predicts a fairly small proportion (19%) of change in exam points. In other words, attitude is not a good enough predictor for exam points, even though its role in influencing the results should be recognized. Random error or other factors should have played roles in the exam points. 

5 Diagnostic plots

Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots.

par(mfrow = c(2,2))
plot(model2, which = c(1,2,5))

Description:

   a. Residuals vs fitted plot shows that the data points are randomly scattered around the dotted line of y = 0, and the fitted line (red) is roughly horizontal without distinct patterns or trends, indicating a linear relationship. The linearity assumption of linear regression is examined. 
   
   b. The QQ plot shows that most of the points plotted on the graph lies on the dashed straight line, except for the lower and upper ends, where some points deviated from the line, indicating the distribution might be slightly skewed. Nevertheless, the assumption of normality can be approximately met, considering that in large sample size, the assumption of linearity is almost never perfectly met. 
   
   c. The Residuals vs Leverage plot indicates that there is no three outlier observation with large Cook’s distances. If the plot shows any outlier observation, they are recommended to be removed from the data because they have high influence in the linear model. 

Chapter 3: Logistic Regression

date()
## [1] "Sun Nov 26 15:23:07 2023"

1 Read the data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
# read the data
alc <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=TRUE)
# print out the names of the variables
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)
## [1] 370  35
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

This data set includes the information of 370 students’ background, achievement, and alcohol consumption in two Portuguese schools. There are 35 variables in the data, including student grades, demographic, social and school related features, as well as students’ performance in Mathematics (mat) and Portuguese language (por). The data was collected by using school reports and questionnaires.

2 Hypothesis

The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.

Based on my everyday observation and some research reports about teenagers’ alcohol consumpution, I would like to choose family relationships (“famrel”), number of school absences (“absences”), weekly study time (“studytime”) and frequency of going out with friends (“goout”) as candidate indicators to predict the alcohol consumption. Htpothesis: a student belongs to the group of high alcohol consumption if he or she (1) has low quality family relationship; (2) more frequency of school absences; (3) less weekly study time; and (4) more frequency of going out with friends.

3 The distribution of the chosen variables and their relationships with alcohol consumption

Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses.

3.1 The distribution of the chosen variables

library(ggplot2)
library(tidyr)

alc |> 
  select(absences, studytime, famrel, goout) |> 
  pivot_longer(everything(), 
               names_to = "variable", 
               values_to = "value") |> 
  ggplot(aes(x = value))+
  facet_wrap(~variable, scales = "free")+
  geom_bar()+
  labs(title = "Distribution of the interested variables",
       x  = "Values of each variable",
       y = "Frequency")

  1. The distribution of “absences” is skewed to right with very long tail. This indicates that most students have full or almost full attendance of class, while a small number of students might be absent for quite a number of classes.
  2. Other three variables are obtained by item with Likert-marks, with labeling value ranging from 1 to 5. For family relationship quality, it is found most students having good or very good family relationship quality, with more around 2/3 of them being at the high end the choices.
  3. It is found most students tend to be very social to spend much time going out with friends.
  4. For study time, most students spend 2 to 5 hours in a week studying, and very few of them spend more than 10 hours a week in studying.

3.2 The relationships between interesting variables and alcohol consumption

# The relationship between students' absences of class and alcohol consumption (box plot for numerical~categorical variables)
p1 <- alc |> 
  ggplot(aes(x = high_use, y = absences)) +
  geom_boxplot() +
  geom_jitter(width=0.25, alpha=0.5)+
  labs(x = "Alcohol consumption", 
       y = "Freuqncy of class absences",
       title = 
         "Frequency of class absences and alcohol consumption")+
  scale_x_discrete(labels = c("FALSE" = "Non-high-user", 
                              "TRUE" = "high-user"))
p1 

# The relationship between quality of family relation and alcohol consumption (bar plot for categorical variables)
p2 <- alc |> 
  ggplot(aes(x = factor(famrel), fill = high_use)) +
  geom_bar(position = "fill", color = "black") +
  labs(x = "Quality of family relationships", 
       y = "Proportion of alcohol high-users",
       title = 
         "Quality of family relationships and alcohol consumption")+
  guides(fill=guide_legend("Alcohol consumption"))+
  scale_fill_discrete(labels = c("FALSE" = "Non-high-user", 
                                 "TRUE" = "high-user"))
p2

# The relationship between going out with friends and alcohol consumption (bar plot for categorical variables)
p3 <- alc |> 
  ggplot(aes(x = factor(goout), fill = high_use)) +
  geom_bar(position = "fill", color = "black") +
  labs(x = "Going out with friends", 
       y = "Proportion of alcohol high-users",
       title = 
         "Going out with friends and alcohol consumption")+
  guides(fill=guide_legend("Alcohol consumption"))+
  scale_fill_discrete(labels = c("FALSE" = "Non-high-user", 
                                 "TRUE" = "high-user"))
p3

# The relationship between weekly study time and alcohol consumption (bar plot for categorical variables)
p4 <- alc |> 
  ggplot(aes(x = factor(studytime), fill = high_use)) +
  geom_bar(position = "fill", color = "black") +
  labs(x = "Weekly study time", 
       y = "Proportion of alcohol high-users",
       title = 
         "Weekly study time and alcohol consumption")+
  guides(fill=guide_legend("Alcohol consumption"))+
  scale_fill_discrete(labels = c("FALSE" = "Non-high-user", 
                                 "TRUE" = "high-user"))
p4

# combining four plots together
library(patchwork)
p1 + p2 + p3 + p4

Four plots were made to explore the relationships between four variables and the alcohol consumption. (1) The box plot shows that the frequency and median(Q1,Q3) of class absences differed greatly between alcohol high-users and non-high-users. The students with high alcohol use are associated with more class absences, as hypothesized. (2) The first bar plot shows a difference result from the hypothesis in terms of the relationship between quality of family relation and alcohol consumption. Students who have worst relationship with their family are not the highest consumption group. Rather the students who consume most alcohol are those who have bad or middle level of family relationship. (3) The second bar plot shows that the more frequency students going out with their friends, the more alcohol consumption, as hypothesized. (4) The third bar plot shows that the more time students spend in studying every week, that less alcohol consumption, as hypothesized.

4 Logistic regression

Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable. Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis. Hint: If your model includes factor variables, see for example the RHDS book or the first answer of this stack exchange thread on how R treats and how you should interpret these variables in the model output (or use some other resource to study this).

# Fit the model base on the original hypothesis
model <- glm(high_use ~ absences + famrel + goout + studytime, data = alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ absences + famrel + goout + studytime, 
##     family = "binomial", data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.15221    0.71444  -1.613 0.106798    
## absences     0.06655    0.02192   3.036 0.002394 ** 
## famrel      -0.35898    0.13796  -2.602 0.009266 ** 
## goout        0.75990    0.12143   6.258 3.91e-10 ***
## studytime   -0.57194    0.16963  -3.372 0.000747 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 370.36  on 365  degrees of freedom
## AIC: 380.36
## 
## Number of Fisher Scoring iterations: 4
# Important parameters
OR <- coef(model)
CI <- confint(model)
## Waiting for profiling to be done...
ORCI <- cbind(OR, CI)
print(ORCI, digits = 1)
##                OR 2.5 % 97.5 %
## (Intercept) -1.15 -2.57   0.24
## absences     0.07  0.02   0.11
## famrel      -0.36 -0.63  -0.09
## goout        0.76  0.53   1.01
## studytime   -0.57 -0.92  -0.25

All of the hypothesized predictors have good level of statistical significance in the model (p<0.01), indicating that all of the four hypothesized predictors are significant in predicting alcohol consumption. These four predictors have different influences on alcohol consumption. (1) Absences: Participants who have one more time of absence from class will on average have 0.067 (95%CI: 0.02~0.11) times more odds being an alcohol high-user. (2) Quality of family relationships: Every unit of family relationship quality increase would lead to 0.36 (95%CI: 0.09~0.63) times less odds being alcohol high-user. (3) Going out with friends: Students who have one more level of social involvement with their friends have 0.76 (95%CI: 0.53~1.01) times more odds of being alcohol high-users. (4) Weekly study time: Those who have one more level of weekly study time have on average 0.57 (95%CI: 0.25~0.92) times less odds to be an alcohol high-user. These findings are consistent with previous hypotheses.

5 Predicted probabilities and cross tabulation

Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy.

# Explore the predictive power of the model
probabilities <- predict(model, type = "response")
# Add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# Use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# See the last ten original classes, predicted probabilities, and class predictions
select(alc, absences, famrel, goout, studytime, high_use, probability, prediction) %>% tail(10)
##     absences famrel goout studytime high_use probability prediction
## 361        3      4     3         2    FALSE  0.22224385      FALSE
## 362        0      4     2         1    FALSE  0.16242879      FALSE
## 363        7      5     3         1     TRUE  0.31573116      FALSE
## 364        1      5     3         3    FALSE  0.08975198      FALSE
## 365        6      4     3         1    FALSE  0.38200795      FALSE
## 366        2      5     2         3    FALSE  0.04697548      FALSE
## 367        2      4     4         2    FALSE  0.36371174      FALSE
## 368        3      1     1         2    FALSE  0.15505370      FALSE
## 369        4      2     5         1     TRUE  0.83529411       TRUE
## 370        2      4     1         1     TRUE  0.09388808      FALSE
# Tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>% 
  addmargins
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   233   26 259
##    TRUE     63   48 111
##    Sum     296   74 370
# Display a graphic visualizing actual high_use and predictions
p5 <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))+
  geom_point()
p5

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%
  prop.table()%>%
  addmargins()%>%
  print(digits = 2)
##         prediction
## high_use FALSE TRUE  Sum
##    FALSE  0.63 0.07 0.70
##    TRUE   0.17 0.13 0.30
##    Sum    0.80 0.20 1.00

Among 259 participants who are not alcohol high-users, the model correctly predicts 233 (90%) of them. Among 111 participants who are alcohol high-users, the model correctly predicts 63 of them (57%) of them. In all, among the 370 predicts, 74 (20%) were inaccurate.

6 10-fold cross-validation (bonus)

Bonus: Perform 10-fold cross-validation on your model. Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which had about 0.26 error). Could you find such a model?

# Define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# Call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2405405
# K-fold cross-validation
library(boot)
set.seed(1)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K = 10)
# average number of wrong predictions in the cross validation
delta.percent <- paste0((cv$delta[1]|> round(4))*100, "%") 

The prediction error rate is 24%, outperforming the model in Exercise Set 3, which had about 26% error. According to the result of 10 fold cross-validation, the model has an average error rate of 23.51%, a bit lower than the results from training model.

7 The relationship between prediction error and number of predictors (super bonus)

Perform cross-validation to compare the performance of different logistic regression models (= different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to models with less predictors. Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.


Chapter 4: Clustering and classification

date()
## [1] "Sun Nov 26 15:23:08 2023"

1 Load the Boston data

Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The data set is about housing values in suburbs of Boston. There are 506 observations of 14 variables, including numeric and integer variables. The 14 variables respectively refer to: 1. crim: per capita crime rate by town. 2. zn: proportion of residential land zoned for lots over 25,000 sq.ft. 3. indus: proportion of non-retail business acres per town. 4. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 5. nox: nitrogen oxides concentration (parts per 10 million). 6. rm: average number of rooms per dwelling. 7. age: proportion of owner-occupied units built prior to 1940. 8. dis: weighted mean of distances to five Boston employment centres. 9. rad: index of accessibility to radial highways. 10. tax: full-value property-tax rate per $10,000. 11. ptratio: pupil-teacher ratio by town. 12. black: 1000(Bk−0.63)2 where Bk is the proportion of blacks by town. 13. lstat: lower status of the population (percent). 14. medv: median value of owner-occupied homes in $1000s.

2 Graphical overview

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

# Show summaries of the variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# First make a distribution plot for 14 variables
library(ggplot2)
library(tidyr)

# Convert the data to long format for easier plotting
long_boston <- gather(Boston)

# Plotting
p1 <- ggplot(long_boston, aes(x = value)) +
  geom_density(fill = "skyblue", color = "black") +
  facet_wrap(~key, scales = "free") +
  theme_minimal() +
  labs(title = "Overview of Boston dataset")

# Print the plot
p1

# Then make a correlations plot to look at the correlations among the 14 variables in the data
library(corrplot)
## corrplot 0.92 loaded
# Calculate the correlation matrix and round it
cor_matrix <- cor(Boston) |> 
  round(digits = 2)

# Print the correlation matrix
print(cor_matrix)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# Visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper")

From the distribution plot, most of the variables are skewed to right or left direction. Only rm is realatively normally distributed. From the correlation matrix, the strongest correlations are between the variables rad and tax (positive), dis and age (negative), dis and nox (negative), dis and indus (negative), lstat and medv (negative).

3 Standardize the dataset; create a categorical variable of the crime rate; divide the dataset to train and test sets

Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

3.1 Standardize the dataset

# Center and standardize variables
boston_scaled <- scale(Boston) |> 
  as.data.frame()

# Summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

After scaling the data, all the means of the variables turn to zero. Most of the values of the variables ranged from -4 and 4, only except for crim (crime rate).

3.2 Create a categorical variable of the crime rate

# Summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# Create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# Create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE,
             labels = c("low", "med_low","med_high", "high"))

# Look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

3.3 Divide the dataset to train and test sets

library(dplyr)

# Number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# Choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# Create train set
train <- boston_scaled[ind,]

# Create test set 
test <- boston_scaled[-ind,]

4 Linear discriminant analysis

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.

# Fit the linear discriminant analysis
lda.fit <- lda(crime~., data = train)

# Print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2450495 0.2400990 0.2599010 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       0.94768689 -0.9154164 -0.15765625 -0.8664804  0.49324129 -0.8330934
## med_low  -0.06128399 -0.2905159 -0.03371693 -0.5543909 -0.09719511 -0.3794097
## med_high -0.37054365  0.2063740  0.05238023  0.3768264  0.15156095  0.3976410
## high     -0.48724019  1.0149946 -0.04735191  1.0267822 -0.39946717  0.8122921
##                 dis        rad        tax    ptratio       black        lstat
## low       0.8750206 -0.6897369 -0.7780503 -0.4386754  0.38100748 -0.767876258
## med_low   0.3826266 -0.5456857 -0.4654718 -0.1017024  0.31146285 -0.162813217
## med_high -0.4097682 -0.3756701 -0.2603363 -0.2604135  0.09270391  0.006875968
## high     -0.8535887  1.6596029  1.5294129  0.8057784 -0.76867030  0.834420910
##                medv
## low       0.5582392
## med_low   0.0121384
## med_high  0.2007779
## high     -0.6528397
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.11885002  0.66122613 -0.98747678
## indus   -0.04130912 -0.23640126  0.23839579
## chas    -0.07358997 -0.01543395  0.21302462
## nox      0.33296411 -0.76745164 -1.35199275
## rm      -0.08676976 -0.07017084 -0.15558456
## age      0.29552614 -0.15968266 -0.24988799
## dis     -0.11193174 -0.11699745  0.17008638
## rad      3.12327648  1.09445683 -0.22148103
## tax      0.10112350 -0.22764844  0.78165323
## ptratio  0.14306963  0.10445909 -0.33126409
## black   -0.15592597 -0.01390650  0.09298486
## lstat    0.17766584 -0.24785436  0.32723953
## medv     0.17126083 -0.36432554 -0.26921116
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9537 0.0335 0.0128
# Draw the LDA (bi)plot
# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# Target classes as numeric
classes <- as.numeric(train$crime)

# Plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Biplot based on LD1 and LD2 was generated. From the LDA biplot, the clusters of low, med_low, and med_high classes separate poorly. Heavy overlap was observed between these three clusters. The cluster of high class separates well. But the clusters high and med_iumH_high also showed notable overlaps. Based on arrows, varaibles rad explained the most for cluster of high class. Contributions of variables to other clusters are not clear enough due to the heavy overlap.

5 Predict LDA

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.

# Save the correct classes from test data
correct_classes <- test$crime

# Remove the crime variable from test data
test <- dplyr::select(test, -crime)

# Predict classes with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)

# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       8        0    0
##   med_low    5      16        6    0
##   med_high   0      15       14    0
##   high       0       0        1   21

The cross tabulated results show that most of the predictions on the classes of med_low, med_high, and high are correct. But the prediction of the low class has just only less than a half correctness. This result shows a not so satisfactory predicting effect of our linear discriminant analysis.

6 Distance measure

Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.

6.1 Reload the Boston dataset and standardize the dataset

data("Boston")
boston_scaled <- scale(Boston) |> 
  as.data.frame()

6.2 Calculate the distances between the observations

# Euclidean distances matrix
dist_eu <- dist(boston_scaled)

# Look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

6.3 Determine K-means for the optimal number of clusters

set.seed(123)
k_max <- 10

# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# Visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal.

6.4 Visualize the clusters and interpret the results

# K-means clustering with 2 clusters
km <- kmeans(boston_scaled, centers = 2)

# Plot the scaled Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

pairs(boston_scaled[1:7], col = km$cluster)

pairs(boston_scaled[8:14], col = km$cluster)

Most of the variables are influential linear separators for the clusters, except rad, ptratio, and tax.

7 Super-Bonus: 3D plot

Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points. Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# Install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p3 <- plot_ly(x = matrix_product$LD1, 
        y = matrix_product$LD2, 
        z = matrix_product$LD3, 
        color = train$crime, #Set the color to be the crime classes of the train set. 
        type= 'scatter3d', 
        mode='markers',
        size = 2)
p3
# Draw another 3D plot where the color is defined by the clusters of the k-means
# Get the clusters of k-means for the train set
train.km <- kmeans(model_predictors, centers = 2) 

p4 <- plot_ly(x = matrix_product$LD1, 
        y = matrix_product$LD2, 
        z = matrix_product$LD3, 
        type= 'scatter3d', 
        mode='markers', 
        color = factor(train.km$cluster), #color defined by clusters of the k-means
        size = 2)
p4
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

The LDA was trained according to a mathematical category of crime rates (quantiles), which has four categories. While k = 2 was adopted for the k-means clustering base on the size of within-cluster sum of square. Since LDA is a supervised technique, we know what are each categories represent, which are also labeled in the caption. K-means clustering is a unsupervised method and thus I do not know anything about the real-world representation of the 2 clusters identified before observing closely. However, by observing the 3D plots together, it is interesting to find out that, cluster 2 from k-means nicely overlaps with high category from LDA. Also, cluster 1 from k-means roughly overlaps with the other three categories from LDA.